Exploiting Soft and Hard Correlations in Big Data Query Optimization

نویسندگان

  • Hai Liu
  • Dongqing Xiao
  • Pankaj Didwania
  • Mohamed Y. Eltabakh
چکیده

Big data infrastructures are increasingly supporting datasets that are relatively structured. These datasets are full of correlations among their attributes, which if managed in systematic ways would enable optimization opportunities that otherwise will be missed. Unlike relational databases in which discovering and exploiting the correlations in query optimization have been extensively studied, in big data infrastructures, such important data properties and their utilization have been mostly abandoned. The key reason is that domain experts may know many correlations but with a degree of uncertainty (fuzziness or softness). Since the data is big, it is very challenging to validate such correlations, judge their worthiness, and put strategies for utilizing them in query optimization. Existing techniques for exploiting soft correlations in RDBMSs, e.g., BHUNT, CORDS, and CM, are heavily tailored towards optimizing factors inherent in relational databases, e.g., predicate selectivity and random I/O accesses of secondary indexes, which are issues not applicable to big data infrastructures, e.g., Hadoop. In this paper, we propose the EXORD system to fill in this gap by exploiting the data’s correlations in big data query optimization. EXORD supports two types of correlations; hard correlations—which are guaranteed to hold for all data records, and soft correlations—which are expected to hold for most, but not all, data records. We introduce a new three-phase approach for (1) Validating and judging the worthiness of soft correlations, (2) Selecting and preparing the soft correlations for deployment by specially handling the violating data records, and (3) Deploying and exploiting the correlations in query optimization. We propose a novel cost-benefit model for adaptively selecting the most beneficial soft correlations w.r.t a given query workload while minimizing the introduced overhead. We show the complexity of this problem (NP-Hard), and propose a heuristic to efficiently solve it in a polynomial time. EXORD can be integrated with various state-of-art big data query optimization techniques, e.g., indexing and partitioning. EXORD prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD in achieving more than 10x speedup while introducing minimal storage overheads. ∗This project is partially supported by NSF-CRI 1305258 grant. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 9, No. 12 Copyright 2016 VLDB Endowment 2150-8097/16/08.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relational Databases Query Optimization using Hybrid Evolutionary Algorithm

Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...

متن کامل

Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey

Discovering and exploiting statistical features in relational datasets is key to query optimization in a relational database management system (rdbms), and is also needed for database design, cleaning, and integration. This paper surveys a variety of methods for automatically discovering important statistical features such as correlations, functional dependencies, keys, and algebraic constraint...

متن کامل

Comparative Study of Multi-query Optimization Techniques using Shared Predicate-based for Big Data

Big data analytical systems, such as MapReduce, have become main issues for many enterprises and research groups. Currently, multi-query which translated into MapReduce jobs is submitted repeatedly with similar tasks. So, exploiting these similar tasks can offer possibilities to avoid repeated computations of MapReduce jobs. Therefore, many researches have addressed the sharing opportunity to o...

متن کامل

Exploiting Opportunistic Physical Design in Large-scale Data Analytics

Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries o...

متن کامل

Workload-Driven Vertical Partitioning for Effective Query Processing over Raw Data

Traditional databases are not equipped with the adequate functionality to handle the volume and variety of “Big Data”. Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data processing has been proposed as a schema-on-demand alternative that provides instant access to the data. When loading is an option, it is driven exclusively by the cu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2016